Graphical interface for speech-enabled processing

ABSTRACT

Methods and devices for sampling applications using a touch input are described herein. In some embodiments, an electronic device detects a touch input, which may cause the electronic device to send identifiers to a backend system. The backend system may then determine an application and sample audio request associated with the received identifiers. The backend system may then receive text data representing the sample audio request and text data representing a response to the sample audio request. The backend system may generate audio data representing the received text data and send the audio data to the electronic device. If, the touch input is still occurring, the backend system may find and send more sample audio requests and the responses thereof. If the touch input stops occurring during the sample, the backend system may send instructions to the electronic device to stop outputting the sample.

CROSS-REFERENCE TO RELATED APPLICATION DATA

This application is a continuation of, and claims the benefit ofpriority of, U.S. Non-provisional patent application Ser. No.15/198,613, entitled “GRAPHICAL INTERFACE TO PREVIEW FUNCTIONALITYAVAILABLE FOR SPEECH-ENABLED PROCESSING,” filed on Jun. 30, 2016, whichis incorporated herein by reference in its entirety.

BACKGROUND

Voice activated electronic devices are becoming more prevalent in modernsociety. In an effort to make voice activated electronic devices moreuser friendly, the devices are customizable through the activation ofspecific functionality. Described herein are technical solutions toimprove the user experience with such and other machines.

BRIEF DESCRIPTION OF THE DRAWINGS

FIGS. 1A, 1B, and 1C, are illustrative diagrams of a system for using atouch input to sample a functionality of a backend system, in accordancewith various embodiments;

FIG. 2 is an illustrative diagram of the architecture of the system ofFIG. 1A in accordance with various embodiments;

FIGS. 3A and 3B are illustrative diagrams of a system for stopping asample of a functionality of a backend system in accordance with variousembodiments;

FIG. 4A is an illustrative flowchart of a process for using a touchinput to sample a functionality of a backend system in accordance withvarious embodiments;

FIG. 4B is an illustrative flowchart continuing the process in FIG. 4Ato receive another sample of a functionality of a backend system inaccordance with various embodiments;

FIG. 5 is an illustrative flowchart of a process for stopping a sampleof a functionality of a backend system in accordance with variousembodiments; and

FIG. 6 is an illustrative diagram of an exemplary user interface showingmultiple applications in accordance with various embodiments.

DETAILED DESCRIPTION

The present disclosure, as set forth below, is generally directed tovarious embodiments of methods and devices for previewing variousfunctionalities for an electronic device in response to a touch input.An individual may, in a non-limiting embodiment, touch and hold apreview displayed on a requesting device, which may be in communicationwith a backend system. The backend system, for instance, may include oneor more functionalities, or may be in communication with additionalsystems including one or more functionalities capable of providing therequesting device content and/or causing particular actions. Suchfunctionalities may need to be enabled for a user account on the backendsystem prior to being used by the requesting device. In order todetermine whether to add a functionality of the backend system to theiruser account, an individual can preview the functionality before beingenabled. In some embodiments, the individual may touch a display oftheir electronic device, which may have a preview of the functionalitydisplayed thereon. For example, a sample of the functionality may bedisplayed within a client application on the individual's electronicdevice, which the individual may press for a predefined amount of time(i.e., long press) in order to preview that functionality.

In some embodiments, a functionality of the backend system (which asused herein includes source code sometimes referred to as “skills”) maybe previewed in response to touch inputs. For instance, an individualmay choose to preview a particular functionality using a local clientapplication for the backend system on their electronic device. The localclient application may communicate with a web-based server, in someembodiments. This communication may allow an electronic device toperform various functions associated with the web-based server locallythrough the client application.

The local client application may include exemplary invocations, as wellas replies to those invocations, that provide the individual withexamples of how the various functionalities may be used. To sample askill, an individual may long press (e.g., contact a display screen forlonger than a predefined amount of time) on a portion of theirelectronic device's display that is displaying a sample invocationthereon. After the electronic device detects a particular touch input,such as a long press, on particular location of the device's displayscreen that is displaying a sample invocation, the electronic device maysend a first identifier associated with the functionality of the backendsystem to the backend system. The first identifier may allow the backendsystem to determine the particular functionality that the individual isselecting for preview. For instance, the individual may want to try a“Daily Jokes” functionality of the backend system before adding thefunctionality to their user account. In some embodiments, afterreceiving the first identifier, the backend system may further receive asecond identifier, which may allow the backend system to identify aparticular sample invocation/reply that the individual is selecting tobe previewed. For instance, the individual may also select a particularexample joke displayed within the sample of the “Daily Jokes”functionality displayed by the client application. As an illustrativeexample, the individual may touch a portion of their device's displayscreen having a sample invocation, “Alexa, tell daily jokes to tell me ajoke,” displayed thereon. In some embodiments, the individual mayperform a long press on a portion of the display screen that the sampleinvocation is displayed on.

In some embodiments, after the backend system determined thefunctionality that has requested to be previewed, the backend system maydetermine first text data representing the first preview invocation. Forexample, if “Daily Jokes” was selected, the backend system may determinefirst text data associated with the “Daily Jokes” functionality. Thefirst text data may represent the first preview invocation, and firstaudio data may be generated representing the first text data byperforming text-to-speech processing on the first text data. Similarly,the backend system may also determine second text data representing afirst preview reply (e.g., “Knock, Knock.” “Who's there?”) to the firstpreview invocation, and second audio data representing the second textdata may also be generated by performing text-to-speech processing onthe second text data. In some embodiments, the first audio data may besent to the electronic device such that the first preview invocation isplayed by the electronic device. Furthermore, the second audio data mayalso be sent to the electronic device such that the first preview replyis played after the first preview invocation.

In some embodiments, the backend system may cause the first audio datato be played such that a specific voice type is used. Various voicetypes may be stored by the backend system. For instance, a first voicetype may be used for the first audio data, while a second voice type maybe used for the second audio data. As used herein, a voice type mayrefer to a predefined audio frequency range with which audio datagenerated, such that the audio data, when output by an electronicdevice, has that predefined audio frequency range. In some embodiments,the backend system may receive an instructions from the clientapplication indicating which voice type to be used. These instructionsmay cause text-to-speech processing to generate audio data with thatspecific voice type. Furthermore, one or more accents or pronunciationsmay be employed for any voice type. For example, one voice type mightemploy a New York accent, such that, if used, the first previewinvocation and/or first preview reply are spoke using a New York accent.

After playing the second audio data, the individual may want to hearmore samples of the functionality. The individual, in some embodiments,may continue to contact the display screen (e.g., continuing the longpress). This may cause additional audio data representing additionalpreview invocations and/or replies to be generated and sent to theelectronic device to be played. For instance, three preview invocationsand three preview replies may be displayed within the clientapplication. If the individual continues to perform a long press ontheir electronic device, in this particular scenario, then in additionalto playing the first and second audio data representing the firstpreview invocation and the first preview reply, the backend system maygenerate and send audio data representing the second preview invocation,the second preview reply, the third preview invocation, and/or the thirdpreview reply.

FIGS. 1A, 1B, 1C, and 1D are illustrative diagrams of a system for usinga touch input to sample a functionality of a backend system, inaccordance with various embodiments. Electronic device 10, in someembodiments, may correspond to any electronic device or system. Varioustypes of electronic devices may include, but are not limited to, desktopcomputers, mobile computers (e.g., laptops, ultrabooks), mobile phones,smart phones, tablets, televisions, set top boxes, smart televisions,watches, bracelets, display screens, personal digital assistants(“PDAs”), smart furniture, smart household devices, smart vehicles,smart transportation devices, and/or smart accessories. In someembodiments, electronic device 10 may be relatively simple or basic instructure such that no mechanical input option(s) (e.g., keyboard,mouse, trackpad) or touch input(s) (e.g., touchscreen, buttons) may beprovided. In some embodiments, however, electronic device 10 may alsocorrespond to a network of devices. In one exemplary, non-limitingembodiment, an individual may perform a touch input 2 on display screen14 of electronic device 10 to select statement 18 of skill Daily Jokes16. For example, touch input 2 may correspond to an individual pressingon display screen 14 for a certain amount of time (e.g., approximatelytwo seconds) using one or more objects (e.g., finger(s), stylus, etc.).A touch input having a temporal duration of approximately two secondsmay, in some embodiments, be referred to as a long press. However, along press may correspond to an input that has any temporal duration.Persons having ordinary skill in the art recognize that the use of a twosecond temporal duration is merely exemplary. Alternatively, touch input2 may correspond to any particular type of touch input, such as, andwithout limitation, a tap, a swipe, a clockwise motion, acounterclockwise motion, or any type of touch input, or any combinationthereof.

Alternatively, in some embodiments, electronic device 10 may be voiceactivated. A user may be able to receive a command in order to selectstatement 18 of skill 16. A command may include an utterance of awakeword (e.g. “Alexa” or “Amazon”), followed by a question. A commandmay correspond to a question regarding the selected skill. For example,a user may say, “Alexa—Sample skill Daily Jokes.” However, alternativeor additional commands may include, but are not limited to, “Alexa—Howdo I use skill Daily Jokes?” or “Alexa—How does skill Daily Jokes work?”

Furthermore, in some embodiments, electronic device 10 may correspond toa manually activated electronic device. In this particular scenario, theelectronic device may be activated in response to a user input, such aspressing a button, touching a display screen, waving a hand, and thelike. After the user input is detected, audio may begin to be captured.In some embodiments, audio may be captured by the manually activatedelectronic device for a predefined amount of time, such as a fewseconds. However, manually activated electronic device may also recordaudio data until speech is no longer detected by one or more microphonesof the manually activated electronic device.

As used herein, the term “wakeword,” may also refer to any “keyword” or“key phrase,” any “activation word” or “activation words,” or any“trigger,” “trigger word,” or “trigger expression.” Persons of ordinaryskill in the art will recognize that the aforementioned wakewords,“Alexa” and “Amazon” are merely exemplary, and any word, series of words(e.g., “Hello” or “Good Morning”) may be used a wakeword. Furthermore,wakeword may be set or programmed by an individual, and, in someembodiments, electronic device 10 may have more than one wakeword (e.g.,two or more different wakewords) that may each activate electronicdevice 10. Furthermore, the trigger that may be used to activateelectronic device 10 may be any series of temporally related sounds.

An individual browsing graphical user interface 24 on display screen 14of electronic device 10 may want to try out a functionality of backendsystem 100. For example, an option to preview a functionality of thebackend system named “Daily Jokes,” which may be capable of providingthe individual with a new joke each day, may be displayed so that apreview of this functionality may be provided to the individual. Afunctionality (which may include computer readable code sometimesreferred to as a “skill”), as used herein, may correspond a set ofrules, terms, and frameworks capable of being used to update a languagemodel associated with an individual's user account on backend system100. Such functionalities, when enabled, may cause the language model tobe updated such that additional words or phrases are recognizable, andcapable being use for responded to. For example, if the Daily Jokes 16is enabled, word or phrases such as “jokes,” “daily jokes,” and/or “telldaily jokes to tell me a joke,” may be included in an individual'slanguage model such that subsequent utterances including those wordsand/or phrase, are capable of being responded to using the functionalityof Daily Jokes 16. Various types of exemplary functionalities maycorrespond to the weather, ordering a taxi, ordering a pizza, and/orhearing/telling a joke. Persons of ordinary skill will recognize thatthe aforementioned are merely exemplary and that functionalities may beincluded in the individual's language mode, such that he/she may havetheir user experience customized.

In some embodiment, a user may choose a specific skill because of rating20. In an exemplary, non-limiting embodiment, rating 20 may refer to howmuch other users liked the skill. Rating 20, in one embodiment, may bebased on a star rating. A star rating may correspond to a system wheremore stars is associated with a better rating. While a star rating isdescribed herein, persons of ordinary skill in the art would recognizethat any kind of rating system or metric may be used.

After a user has selected Daily Jokes 16, such as by performing a touchinput 2 on display screen 14 of electronic device 10 at a particularlocation where Daily Jokes is being displayed, electronic device 10 maysend instructions to backend system 100 indicating that a previewinvocation and/or reply for particular functionality is to be previewed.An invocation, as used in this particular embodiment, may correspond toa portion of an utterance that is spoken after a trigger, such as awakeword or manual input. For example, an utterance may include awakeword, (e.g., “Alexa”) that is subsequently followed by an invocation(e.g., “tell ‘Daily Jokes’ to tell me a joke”). In this example, thename “Daily Jokes” may correspond to a name associated with a particularfunctionality of backend system 100. In some embodiments, an invocationmay not require the name of the functionality to be used. For example,the invocation might simply correspond to “tell me a joke,” or “play ajoke for me.”

Display screen 14 may detect touch input in a variety of ways. Forexample, touch input 2 may be registered by detecting the change inresistance of current when a point on display screen 14 is touched. Thismay be accomplished by having two separate layers of display screen 14.Generally, the bottom layer is made of glass and the top layer may bemade of a plastic film. When an individual pushes down on the film andthe film makes contact with the glass, it completes the circuit. Boththe glass and the film may be covered with a grid of electricalconductors. The conductors may be comprised of fine metal wires. Theyalso may be comprised of a thin film of transparent conductor material.In some embodiments, the conductor material may be indium tin oxide(ITO). In some embodiments, electrodes on the two layers run atperpendicular to each other. For example, the conductors on the glasssheet may run in one direction and the conductors on the plastic filmmay run in a direction 90 degrees from the conductors on the glasssheet. When touch input 2 is detected, an individual may press down ondisplay screen 14. When the film is pressed down on, contact is madebetween the grid of electrical conductors on the glass screen and thegrid of electrical conductors on the plastic film, completing thecircuit. When the circuit is completed, the voltage of the circuit ismeasured. The point on the screen may be measured based on the amount ofresistance at the contact point. The voltage may then be converted byanalog to digital converters, creating a digital signal that electronicdevice 10 can use as an input signal from touch input 2.

As another example, electronic device 10 may use projected capacitance.Electronic device 10 may rely on electrical capacitance. Display screen14 may use two layers of conductors, separated by an insulator. Theconductors, for example, may be made of transparent ITO. In someembodiments, conductors on the two layers run at perpendicular to eachother. For example, the conductors on the glass sheet may run in onedirection and the conductors on the plastic film may run in a direction90 degrees from the conductors on the glass sheet. When touch input 2 isdetected, touch input 2 takes electrical charge from each of theconductive layers at the point of touch input 2. This change in chargecan be measured and a location of touch input 2 can be measured. Eachconductor may be checked separately, making it possible to identifymultiple, simultaneous points of contact on display screen 14. Whileonly two examples of how touch input 2 can be detected by display screen14 are described, persons of ordinary skill recognize that any suitabletechnique for detecting a touch input can be used, and theaforementioned are merely exemplary.

Once touch input 2 has been registered by electronic device 10,electronic device 10 may send a request, Skill ID/Sample Audio ID 4, tobackend system 100. In this particular example, Skill ID/Sample Audio ID4 may include a first identifier and second identifier. For example, thefirst identifier may correspond to a particular functionality while asecond identifier may correspond to a particular example invocation andreply employing the functionality. These identifiers may be a string ofcharacters including numbers, letters, or a combination thereof. In someembodiments, the first identifier and the second identifier may be sentseparately to backend system 100, however this is merely illustrative,as the first identifier and/or the second identifier may be sent. Uponreceipt of the Skill ID/Sample Audio ID 4, backend system 100 canrecognize the functionality (e.g., Daily Jokes 16) and sample invocation18 (e.g., “Alexa, tell daily jokes to tell me a joke.”) selected by theuser. For example, Daily Jokes 16 may have an identifier of “001A” andsample invocation 18 may have an identifier of “001A-1.”

Skill ID/Sample Audio ID 4 may be sent to backend system 100 fromelectronic device 10, and may include one or more pieces of additionaldata, such as a time and/or date that touch input 2 was registered, alocation of electronic device 10 (e.g., a GPS location), an IP addressassociated with electronic device 10, a type of device that electronicdevice 10 is, or any other information, or any combination ofinformation. For example, when touch input 2 is registered, electronicdevice 10 may obtain a GPS location of device 10 to determine a locationof a user as well as a time/date (e.g., hour, minute, second, day,month, year, etc.) when touch input 2 was detected.

Skill ID/Sample Audio ID 4 may be sent over a network, such as theInternet, to backend system 100 using any number of communicationsprotocols. For example, Transfer Control Protocol and Internet Protocol(“TCP/IP”) (e.g., any of the protocols used in each of the TCP/IPlayers), Hypertext Transfer Protocol (“HTTP”), and wireless applicationprotocol (“WAP”), are some of the various types of protocols that may beused to facilitate communications between electronic device 10 andbackend system 100. In some embodiments electronic device 10 and backendsystem 100 may communicate with one another via a web browser usingHTTP. Various additional communication protocols may be used tofacilitate communications between electronic device 10 and backendsystem 100 including, but not limited to, Wi-Fi (e.g., 802.11 protocol),Bluetooth®, radio frequency systems (e.g., 900 MHz, 1.4 GHz, and 5.6 GHzcommunication systems), cellular networks (e.g., GSM, AMPS, GPRS, CDMA,EV-DO, EDGE, 3GSM, DECT, IS-136/TDMA, iDen, LTE or any other suitablecellular network protocol), infrared, BitTorrent, FTP, RTP, RTSP, SSH,and/or VOIP.

Backend system 100 may include one or more servers, each incommunication with one another and/or electronic device 10. Each serverwithin backend system 100 may be associated with one or more databasesor processors that are capable of storing, retrieving, processing,analyzing, and/or generating data to be provided to electronic device10. For example, backend system 100 may include one or more sportsservers for storing and processing information related to differentsports (e.g., baseball, football, hockey, basketball, etc.). As anotherexample, backend system 100 may include one or more traffic servers forstoring traffic information and/or traffic weather information toelectronic device 10. Backend system 100 may, in some embodiments,correspond to a collection of servers located within a remote facility,and individuals may store data on backend system 100 and/or communicatewith backend system 100 using one or more of the aforementionedcommunications protocols.

Backend system 100 may also include one or more computing devices incommunication with the one or more servers of backend system 100, andmay include one or more processors, communication circuitry (includingany circuitry capable of using any of the aforementioned communicationsprotocols), and/or storage/memory. Backend system 100 may also includevarious modules that store software, hardware, logic, instructions,and/or commands for backend system 100 to perform, such as, for example,a speech-to-text (“STT”) module, a text-to-speech (“TTS”) module, askill module, or other modules. A more detailed description of backendsystem 100 is provided below.

Once backend system 100 receives Skill ID/Sample Audio ID 4, backendsystem 100 searches for the corresponding functionality. In oneembodiment, Skill ID/Sample Audio ID 4 may not be an audio file, andtherefore automated speech recognition processing and/or naturallanguage understanding processing may not be needed. In this embodiment,backend system 100 may determine that the user is attempting to sample aparticular functionality corresponding to an invocation. In anotherembodiment, the data received by backend system 100 can include anidentifier flagging the request a sample of a particular functionalityand corresponding to an invocation. Furthermore, Skill ID/Sample AudioID 4 may include data with identifiers for the a particularfunctionality and/or sample invocation/reply selected by the user, andbackend system 100 may search through the various functionalitiescapable of being used with backend system 100 to determine a whichfunctionality, as well as, a particular sample invocation/reply, to usein response. Continuing the above example, when backend system 100receives identifiers “001A” and “001A-1,” backend system 100 can sendthe Skill ID/Sample Audio ID 4 a module for determining a particularfunctionality and/or invocation that have been selected for previewing.After backend system 100 determines the requested functionality and/orinvocation, backend system 100 may generate audio data corresponding tothe preview invocation and/or the preview reply, and may send the audiodata to electronic device 10. Furthermore, in some embodiments, backendsystem 100 generate display data of a rendering for display 14.

In some embodiments, the backend system receives text data representingstatement 18 from a skill within a skills or category server. The skillsor category server is described in more detail below in the descriptionof FIG. 2. This text file is converted into an audio file by executingTTS on the text file. The resulting audio file is Audio file one 6 a.Audio file one 6 a is an audio representation of statement 18. A TSSmodule within backend system 100 is described in more detail in thedescription of FIG. 2.

In some embodiments, a skill module within backend system 100 searchesfor a response to statement 18 of the identified skill Daily Jokes 16.Once the response is identified, a text file representing the responseis sent to the backend system. The text file is then converted into anaudio file by executing TTS on the text file. The resulting audio fileis Audio file two 8 a. Audio file two 8 a is an audio representation ofthe response to statement 18. A skill module within backend system 100is described in more detail in the description of FIG. 2.

In some embodiments, the backend system will also receive display data.The display data may include text representing statement 18. The backendsystem may receive display data one 6 b. Display data one 6 b, in someembodiments, may include text that represents sample audio 18 in textform. The backend system may also receive display data two 8 b. Displaydata two 8 b may include text that represents a response to sample audio18. In some embodiments, the display data one 6 b may be embedded withinaudio file one 6 a. In some embodiments, display data two 8 b may beembedded within audio file two 8 a. Furthermore, in some embodiments,both sets of display data may be sent together.

Audio file one 6 a is then transmitted to electronic device 10. Oncereceived by electronic device 10, audio file one 6 a (an audiorepresentation of statement 18) is played on one or more speakers ofelectronic device 10. Similar to audio file one 6 a, display data one 6b may be sent to electronic device 10. Once received by electronicdevice 10, electronic device 10 may display text representing statement18 on display screen 14. Following Audio file one 6 a, Audio file two 8a is transmitted to electronic device 10. Audio file two 8 a is thenplayed by one or more speakers of electronic device 10. Similar to audiotwo one 8 a, display data two 8 b may be sent to electronic device 10.Once received by electronic device 10, electronic device 10 may displaytext representing a response to statement 18 on display screen 14. Audiofile one 6 a, display data one 6 b, audio file two 8 a, and display datatwo 8 b may be transmitted, similarly to Skill ID/Sample Audio ID 4,over a network, such as the Internet, to electronic device 10 using anynumber of communications protocols. In this embodiment, the user wouldhear the audio invocation 12A, “Alexa, tell daily jokes to tell me ajoke.” Then the user would hear the audio response 12B, “Two people walkinto a bar, Ouch.”

In some embodiments, if an individual likes the sampled skill, theindividual may want to enable the sampled skill. To enable the sampledskill an individual may select Enable Skill 22 on graphical userinterface 24. As an example, an individual, having a user account onbackend system 100, may have the skill entitled “Daily Jokes” enabled.In some embodiments, enabling the skill may include providing thebackend system, more particularly the Natural Language Understanding(NLU) module 260 with one or more additional rules. The rules that areincluded with NLU module 260 for the skill may cause certaininvocations, if detected by NLU module 260, to be serviced using thatskill. For example, if the skill that is enabled is the “Daily Jokes”skill, then invocations that are related to, or directed towards, thecat facts skill may cause the “Daily Jokes” skill to perform one or moreactions, such as providing response information to the invocation. As anillustrative example, if the skill to be enabled is “Daily Jokes,” thenNLU module 260, for the particular user account with which theenablement request was associated, may be provided with a rule that forinvocations of the form, “Alexa—tell daily jokes to tell me a joke,” NLUmodule 260 is to call the “Daily Jokes” skill to obtain information. Thebackend system and NLU are described in more detail below in thedescription of FIG. 2.

FIG. 1B is a representation of display data one 6 b and display data two8 b being received and displayed by electronic device 10. Once displaydata one 6 b is received by electronic device 10, first text invocation12C may be displayed on display screen 14 of electronic device 10.Similar to display data one 6 b, display data two 8 b may also causeelectronic device 10 to display first text response 12D on displayscreen 14. The display may happen before, simultaneously with, or afteraudio invocation 12A and audio response 12B. In some embodiments, thedisplay of electronic device 10 may be altered such that graphical userinterface 14 is out of focus behind first text invocation 12C and firsttext response 12D. In some embodiments, the first text invocation 12Cand first text response 12D may be the only items displayed on displayscreen 14. In some embodiments, Daily Jokes 16 may be redisplayed at thetop of display screen 14.

If an individual wanted to continue to sample the selected skill, theindividual may continue touch input 2. FIG. 1C, shows an individualcontinuing touch input 2 after audio response 12B was played byelectronic device 10. After continuing touch input 2, electronic device10 may send Touch Input Data 26 to backend system 100. Touch Input Data2, in some embodiments, may be an indicator that lets the backend systemknow that touch input 2 is continuing. Touch Input Data 26 may be sent(e.g. transmitted) over a network, such as the Internet, to backendsystem 100 using any number of communications protocols. For example,Transfer Control Protocol and Internet Protocol (“TCP/IP”) (e.g., any ofthe protocols used in each of the TCP/IP layers), Hypertext TransferProtocol (“HTTP”), and wireless application protocol (“WAP”), are someof the various types of protocols that may be used to facilitatecommunications between electronic device 10 and backend system 100. Insome embodiments electronic device 10 and backend system 100 maycommunicate with one another via a web browser using HTTP. Variousadditional communication protocols may be used to facilitatecommunications between electronic device 10 and backend system 100including, but not limited to, Wi-Fi (e.g., 802.11 protocol),Bluetooth®, radio frequency systems (e.g., 900 MHz, 1.4 GHz, and 5.6 GHzcommunication systems), cellular networks (e.g., GSM, AMPS, GPRS, CDMA,EV-DO, EDGE, 3GSM, DECT, IS-136/TDMA, iDen, LTE or any other suitablecellular network protocol), infrared, BitTorrent, FTP, RTP, RTSP, SSH,and/or VOIP.

After receiving Touch Input Data 26, the backend system may search formore samples within Daily Jokes 16. This process is explained in moredetail in the description of FIG. 4B. Once the backend system determinesthat additional samples are available, in some embodiments, the backendsystem receives text data representing a second statement from a skillwithin a skills or category server. The skills or category server isdescribed in more detail below in the description of FIG. 2. In oneembodiment, the second statement may be “Alexa, tell daily jokes to tellme another joke.” This text file is converted into an audio file byexecuting TTS on the text file. The resulting audio file is audio filethree 28 a. Audio file three 28 a is an audio representation of thesecond statement. A TSS module within backend system 100 is described inmore detail in the description of FIG. 2.

In some embodiments, the backend system will also receive display data.The display data may include text representing the second statement. Thebackend system may receive display data three 28 b. Display data three28 b, in some embodiments, may include text that represents the secondstatement in text form. The backend system may also receive display datafour 30 b. Display data four 30 b may include text that represents aresponse to the second sample audio. In some embodiments, the displaydata three 28 b may be embedded within audio file three 28 a. In someembodiments, display data four 30 b may be embedded within audio filefour 30 a. Furthermore, in some embodiments, both sets of display datamay be sent together.

In some embodiments, a skill module within backend system 100 searchesfor a response to the second statement of the identified skill DailyJokes 16. In some embodiments, this response may be “What does a noseypepper do? Get jalapeno business.” Once the response is identified, atext file representing the response is sent to the backend system. Thetext file is then converted into an audio file by executing TTS on thetext file. The resulting audio file is audio file four 30 a. Audio filefour 30 a is an audio representation of the response to the secondstatement. A skill module within backend system 100 is described in moredetail in the description of FIG. 2.

Audio file three 28 a is then transmitted to electronic device 10. Oncereceived by electronic device 10, audio file three 28 a is played on oneor more speakers of electronic device 10. Similar to audio file three 28a, display data three 28 b may be sent to electronic device 10. Oncereceived by electronic device 10, electronic device 10 may display textrepresenting the second statement on display screen 14. Following audiofile three 28 a, audio file four 30 a is transmitted to electronicdevice 10. Audio file four 30 a is then played by one or more speakersof electronic device 10. Similar to audio two four 30 a, display datafour 30 b may be sent to electronic device 10. Once received byelectronic device 10, electronic device 10 may display text representinga response to the second statement on display screen 14. In thisembodiment, the user would hear the audio invocation 32A, “Alexa, telldaily jokes to tell me another joke.” Then the user would hear the audioresponse 32B, “What does a nosey pepper do? Get jalapeno business.”

In some embodiments, touch input 2 may continue past the playing ofAudio file four 30 a. If this happens, the backend system may continueto transmit audio files sampling the selected skill. This may continueuntil there are no examples left. In some embodiments, after the backendsystem has run out of examples, the backend system may start over andtransmit Audio file one 6 a.

FIG. 1D is a representation of display data three 28 b and display datafour 30 b being received and displayed by electronic device 10. Oncedisplay data three 6 b is received by electronic device 10, second textinvocation 32C may be displayed on display screen 14 of electronicdevice 10. Similar to display data three 28 b, display data four 30 bmay also cause electronic device 10 to display second text response 32Don display screen 14. The display may happen before, simultaneouslywith, or after audio invocation 32A and audio response 32B. In someembodiments, second text invocation 32C may be displayed below firsttext response 12D. The continued display may be in response to thecontinued touch input. In some embodiments, the display of electronicdevice 10 may be altered such that graphical user interface 14 is out offocus behind second text invocation 32C and second text response 32D. Insome embodiments, the second text invocation 32C and second textresponse 32D may be the only items displayed on display screen 14. Insome embodiments, Daily Jokes 16 may be redisplayed at the top ofdisplay screen 14.

FIG. 2 is an illustrative diagram of the architecture of the system ofFIG. 1 in accordance with various embodiments. Electronic device 10, insome embodiments, may correspond to any type of electronic devicecapable of receiving a touch input. Electronic device 10 may, in someembodiments, be also capable of recognizing and receiving voice commandsafter detecting the specific sound (e.g., a wakeword or trigger),recognize commands (e.g., audio commands, inputs) within captured audio,and may perform one or more actions in response to the receivedcommands. Various types of electronic devices may include, but are notlimited to, desktop computers, mobile computers (e.g., laptops,ultrabooks), mobile phones, smart phones, tablets, televisions, set topboxes, smart televisions, watches, bracelets, display screens, personaldigital assistants (“PDAs”), smart furniture, smart household devices,smart vehicles, smart transportation devices, and/or smart accessories.In some embodiments, however, electronic device 10 may also correspondto a network of devices.

Electronic device 10 may include one or more processors 202,storage/memory 204, communications circuitry 206, one or moremicrophones 208 or other audio input devices (e.g., transducers), one ormore speakers 210 or other audio output devices, as well as an optionalinput/output (“I/O”) interface 212. However, one or more additionalcomponents may be included within electronic device 10, and/or one ormore components may be omitted. For example, electronic device 10 mayinclude a power supply or a bus connector. As another example,electronic device 10 may not include an I/O interface. Furthermore,while multiple instances of one or more components may be includedwithin electronic device 10, for simplicity only one of each componenthas been shown.

Processor(s) 202 may include any suitable processing circuitry capableof controlling operations and functionality of electronic device 10, aswell as facilitating communications between various components withinelectronic device 10. In some embodiments, processor(s) 202 may includea central processing unit (“CPU”), a graphic processing unit (“GPU”),one or more microprocessors, a digital signal processor, or any othertype of processor, or any combination thereof. In some embodiments, thefunctionality of processor(s) 202 may be performed by one or morehardware logic components including, but not limited to,field-programmable gate arrays (“FPGA”), skill specific integratedcircuits (“ASICs”), skill-specific standard products (“ASSPs”),system-on-chip systems (“SOCs”), and/or complex programmable logicdevices (“CPLDs”). Furthermore, each of processor(s) 202 may include itsown local memory, which may store program modules, program data, and/orone or more operating systems. However, processor(s) 202 may run anoperating system (“OS”) for electronic device 10, and/or one or morefirmware applications, media applications, and/or applications residentthereon.

Storage/memory 204 may include one or more types of storage mediums suchas any volatile or non-volatile memory, or any removable ornon-removable memory implemented in any suitable manner to store data onelectronic device 10. For example, information may be stored usingcomputer-readable instructions, data structures, and/or program modules.Various types of storage/memory may include, but are not limited to,hard drives, solid state drives, flash memory, permanent memory (e.g.,ROM), electronically erasable programmable read-only memory (“EEPROM”),CD-ROM, digital versatile disk (“DVD”) or other optical storage medium,magnetic cassettes, magnetic tape, magnetic disk storage or othermagnetic storage devices, RAID storage systems, or any other storagetype, or any combination thereof. Furthermore, storage/memory 204 may beimplemented as computer-readable storage media (“CRSM”), which may beany available physical media accessible by processor(s) 202 to executeone or more instructions stored within storage/memory 204. In someembodiments, one or more skills (e.g., gaming, music, video, calendars,lists, etc.) may be run by processor(s) 202, and may be stored in memory204.

In some embodiments, storage/memory 204 may include one or more modulesand/or databases, such as speech recognition module 214, list ofwakewords database 216, wakeword detection module 218, and adaptive echocancellation module 220. Speech recognition module 214 may, for example,include an automatic speech recognition (“ASR”) component thatrecognizes human speech in detected audio. Speech recognition module 214may also include a natural language understanding (“NLU”) component thatdetermines user intent based on the detected audio. Also included withinspeech recognition module 214 may be a text-to-speech (“TTS”) componentcapable of converting text to speech to be outputted by speaker(s) 210,and/or a speech-to-text (“STT”) component capable of converting receivedaudio signals into text to be sent to backend system 100 for processing.

List of wakewords database 216 may be a database stored locally onelectronic device 10 that includes a list of a current wakeword forelectronic device 10, as well as one or more previously used, oralternative, wakewords for voice activated electronic device. In someembodiments, an individual may set or program a wakeword for electronicdevice 10. The wakeword may be programmed directly on electronic device10, or a wakeword or words may be set by the individual via a backendsystem skill that is in communication with backend system 100. Forexample, an individual may use their mobile device having the backendsystem skill running thereon to set the wakeword. The specific wakewordmay then be communicated from the mobile device to backend system 100,which in turn may send/notify electronic device 10 of the individual'sselection for the wakeword. The selected activation may then be storedin database 216 of storage/memory 204.

Wakeword detection module 218 may include an expression detector thatanalyzes an audio signal produced by microphone(s) 208 to detect awakeword, which generally may be a predefined word, phrase, or any othersound, or any series of temporally related sounds. Such an expressiondetector may be implemented using keyword spotting technology, as anexample. A keyword spotter is a functional component or algorithm thatevaluates an audio signal to detect the presence of a predefined word orexpression within the audio signal detected by microphone(s) 208. Ratherthan producing a transcription of words of the speech, a keyword spottergenerates a true/false output (e.g., a logical I/O) to indicate whetheror not the predefined word or expression was represented in the audiosignal. In some embodiments, an expression detector may be configured toanalyze the audio signal to produce a score indicating a likelihood thatthe wakeword is represented within the audio signal detected bymicrophone(s) 208. The expression detector may then compare that scoreto a threshold to determine whether the wakeword will be declared ashaving been spoken.

In some embodiments, a keyword spotter may be use simplified ASRtechniques. For example, an expression detector may use a Hidden MarkovModel (“HMM”) recognizer that performs acoustic modeling of the audiosignal and compares the HMM model of the audio signal to one or morereference HMM models that have been created by training for specifictrigger expressions. An MINI model represents a word as a series ofstates. Generally a portion of an audio signal is analyzed by comparingits MINI model to an HMM model of the trigger expression, yielding afeature score that represents the similarity of the audio signal modelto the trigger expression model.

In practice, an HMM recognizer may produce multiple feature scores,corresponding to different features of the HMM models. An expressiondetector may use a support vector machine (“SVM”) classifier thatreceives the one or more feature scores produced by the HMM recognizer.The SVM classifier produces a confidence score indicating the likelihoodthat an audio signal contains the trigger expression. The confidencescore is compared to a confidence threshold to make a final decisionregarding whether a particular portion of the audio signal represents anutterance of the trigger expression (e.g., wakeword). Upon declaringthat the audio signal represents an utterance of the trigger expression,electronic device 10 may then begin transmitting the audio signal tobackend system 100 for detecting and responds to subsequent utterancesmade by an individual.

Adaptive echo cancellation module 220 may include one or more adaptiveecho cancellation filters that filter acoustic echo audio signals fromreceived audio signals. The adaptive echo cancellation filters mayautomatically adapt based on the acoustic environment in and aroundelectronic device 10 based on audio received by electronic device 10. Insome embodiments, adaptive echo cancellation module 220 may beconfigured to enable and disable adaptive echo cancellation for selectedtime periods. During time periods when adaptation is disabled, adaptiveecho cancellation module 200 will may not update the adaptive echocancellation filtered based on any audio signals received by electronicdevice 10, however adaptive echo cancellation module 220 may continue tofilter acoustic echo signals from the incoming audio data.

Communications circuitry 206 may include any circuitry allowing orenabling electronic device 10 to communicate with one or more devices,servers, and/or systems. For example, communications circuitry 206 mayfacilitate communications between electronic device 10 and backendsystem 100. Communications circuitry 206 may use any communicationsprotocol, such as any of the previously mentioned exemplarycommunications protocols. In some embodiments, electronic device 10 mayinclude an antenna to facilitate wireless communications with a networkusing various wireless technologies (e.g., Wi-Fi, Bluetooth®,radiofrequency, etc.). In yet another embodiment, electronic device 10may include one or more universal serial bus (“USB”) ports, one or moreEthernet or broadband ports, and/or any other type of hardwire accessport so that communications circuitry 206 allows electronic device 10 tocommunicate with one or more communications networks.

Electronic device 10 may also include one or more microphones 208 and/ortransducers. Microphone(s) 208 may be any suitable component capable ofdetecting audio signals. For example, microphone(s) 208 may include oneor more sensors for generating electrical signals and circuitry capableof processing the generated electrical signals. In some embodiments,microphone(s) 208 may include multiple microphones capable of detectingvarious frequency levels. As an illustrative example, electronic device10 may include multiple microphones (e.g., four, seven, ten, etc.)placed at various positions about electronic device 10 tomonitor/capture any audio outputted in the environment where electronicdevice 10 is located. The various microphones 208 may include somemicrophones optimized for distant sounds, while some microphones may beoptimized for sounds occurring within a close range of electronic device10.

Electronic device 10 may further include one or more speakers 210.Speaker(s) 210 may correspond to any suitable mechanism for outputtingaudio signals. For example, speaker(s) 210 may include one or morespeaker units, transducers, arrays of speakers, and/or arrays oftransducers that may be capable of broadcasting audio signals and oraudio content to a surrounding area where electronic device 10 may belocated. In some embodiments, speaker(s) 210 may include headphones orear buds, which may be wirelessly connected, or hard-wired, toelectronic device 10, that may be capable of broadcasting audio directlyto an individual.

In some embodiments, electronic device 10 may be hard-wired, orwirelessly connected, to one or more speakers 210. For example,electronic device 10 may cause one or more speakers 210 to output audiothereon. In this particular scenario, electronic device 10 may receiveaudio to be output by speakers 210, and electronic device 10 may sendthe audio to speakers 210 using one or more communications protocols.For instance, electronic device 10 and speaker(s) 210 may communicatewith one another using a Bluetooth® connection, or another near-fieldcommunications protocol. In some embodiments, electronic device 10 maycommunicate with speaker(s) 210 indirectly. For example, electronicdevice 10 may communicate with backend system 100, and backend system100 may communicate with speaker(s) 210. In this particular example,electronic device 10 may send audio data representing a command to playaudio using speaker(s) 210 to backend system 100, and backend system 100may send the audio to speaker(s) 210 such that speaker(s) 210 may playthe audio thereon.

In some embodiments, one or more microphones 208 may serve as inputdevices to receive audio inputs, such as speech from an individual.Electronic device 10, in the previously mentioned embodiment, may thenalso include one or more speakers 210 to output audible responses. Inthis manner, electronic device 10 may function solely through speech oraudio, without the use or need for any input mechanisms or displays.

In one exemplary embodiment, electronic device 10 includes I/O interface212. The input portion of I/O interface 212 may correspond to anysuitable mechanism for receiving inputs from a user of electronic device10. In some embodiments, I/O interface 212 may include a display screenand/or touch screen, which may be any size and/or shape and may belocated at any portion of electronic device 10. Various types ofdisplays may include, but are not limited to, liquid crystal displays(“LCD”), monochrome displays, color graphics adapter (“CGA”) displays,enhanced graphics adapter (“EGA”) displays, variable graphics array(“VGA”) display, or any other type of display, or any combinationthereof. Still further, a touch screen may, in some embodiments,correspond to a display screen including capacitive sensing panelscapable of recognizing touch inputs thereon. Additionally, for example,a camera, keyboard, mouse, joystick, or external controller may be usedas an input mechanism for I/O interface 212. The output portion of I/Ointerface 212 may correspond to any suitable mechanism for generatingoutputs from electronic device 10. For example, one or more displays maybe used as an output mechanism for I/O interface 212. As anotherexample, one or more lights, light emitting diodes (“LEDs”), or othervisual indicator(s) may be used to output signals via I/O interface 212of electronic device 10. In some embodiments, one or more vibratingmechanisms or other haptic features may be included with I/O interface212 to provide a haptic response to touch input 2 from electronic device10.

Backend system 100, as mentioned previously, may, in some embodiments,be in communication with electronic device 10. Backend system 100includes various components and modules including, but not limited to,automatic speech recognition (“ASR”) module 258, natural languageunderstanding (“NLU”) module 260, skills module 262, and text-to-speech(“TTS”) module 264. A speech-to-text (“STT”) module may be included inthe ASR module 258. In some embodiments, backend system 100 may alsoinclude computer readable media, including, but not limited to, flashmemory, random access memory (“RAM”), and/or read-only memory (“ROM”).Backend system 100 may also include various modules that store software,hardware, logic, instructions, and/or commands for cloud-basedinformation system 100, such as, a speaker identification (“ID”) module,a user profile module, or any other module, or any combination thereof.

ASR module 258 may be configured such that it recognizes human speech indetected audio, such as audio captured by electronic device 10. ASRmodule 258 may also be configured to determine an end time of speechincluded within the received audio data, such as an end time of question16. ASR module 258 may include, in one embodiment, one or moreprocessor(s) 252, storage/memory 254, and communications circuitry 256.Processor(s) 252, storage/memory 254, and communications circuitry 256may, in some embodiments, be substantially similar to processor(s) 202,storage/memory 204, and communications circuitry 206, which aredescribed in greater detail above, and the aforementioned descriptionsof the latter may apply. NLU module 260 may be configured such that itdetermines user intent based on the detected audio received fromelectronic device 10. NLU module 260 may include processor(s) 252,storage/memory 254, and communications circuitry 256. In someembodiments, ASR module 258 may include a speech-to-text (“STT”) module266. STT module 266 may employ various speech-to-text techniques.However, techniques for transcribing speech into text are well known inthe art and need not be described in further detail herein, and anysuitable computer implemented speech to text technique may be used toconvert the received audio signal(s) into text, such as SOFTSOUND®speech processing technologies available from the Autonomy Corporation,which is headquartered in Cambridge, England, United Kingdom.

Skills module 262 may, for example, correspond to various actionspecific skills or servers capable of processing various task specificactions. Skills module 262 may further correspond to first party skillsand/or third party skills operable to perform different tasks oractions. For example, based on the context of audio received fromelectronic device 10, backend system 100 may use a certain skill toretrieve or generate a response, which in turn may be communicated backto electronic device 10. Skills module 262 may include processor(s) 252,storage/memory 254, and communications circuitry 256. As an illustrativeexample, skills 262 may correspond to one or more game servers forstoring and processing information related to different game (e.g.,“Simon Says,” karaoke, etc.). As another example, skills 262 may includeone or more weather servers for storing weather information and/orproviding weather information to electronic device 10.

TTS module 264 may employ various text-to-speech techniques. Techniquesfor transcribing speech into text are well known in the art and need notbe described in further detail herein, any suitable computer implementedspeech to text technique may be used to convert the received audiosignal(s) into text, such as SOFTSOUND® speech processing technologiesavailable from the Autonomy Corporation, which is headquartered inCambridge, England, United Kingdom. TTS module 264 may also includeprocessor(s) 252, storage/memory 254, and communications circuitry 256.

Persons of ordinary skill in the art will recognize that although eachof ASR module 258, NLU module 260, skills module 262, and TTS module 264include instances of processor(s) 252, storage/memory 254, andcommunications circuitry 256, those instances of processor(s) 252,storage/memory 254, and communications circuitry 256 within each of ASRmodule 258, NLU module 260, skills module 262, and TTS module 264 maydiffer. For example, the structure, function, and style of processor(s)252 within ASR module 258 may be substantially similar to the structure,function, and style of processor(s) 252 within NLU module 260, howeverthe actual processor(s) 252 need not be the same entity.

FIGS. 3A and 3B are illustrative diagrams of a system for stopping asample of a functionality of a backend system in accordance with variousembodiments. FIGS. 3A and 3B may be similar to FIGS. 1A and 1B and thesame descriptions apply. In one some embodiments, an individual may makea touch input 2 on display screen 14 of electronic device 10 to selectstatement 18 of skill Daily Jokes 16. In one exemplary embodiment, touchinput 2 is pressing on display screen 14 for a temporal duration ofapproximately two seconds.

Once Daily Jokes 16 has been selected, a user can sample Daily Jokes 16by making a touch input 2 on display screen 14 of electronic device 10to select statement 18. An invocation, as used in this particularembodiment, refers to a command that is for the purpose of calling askill. In this example, as in FIGS. 1A and 1B, the command includes thename of the skill (e.g., “Daily Jokes”). In some embodiments, aninvocation may not require the name of the skill being called. Forexample, the invocation might simply state “Alexa, tell me a joke,” or“Alexa, play a joke for me.”

Once touch input 2 has been registered on electronic device 10,electronic device 10 sends a request, Skill ID/Sample Audio ID 4, tobackend system 100. In this particular example, Skill ID/Sample Audio ID4 sent to backend system 100 includes a skill identification number anda sample audio identification number. After receiving Skill ID/SampleAudio ID 4, backend system 100 can recognize the skill (e.g., DailyJokes 16) and statement 18 (e.g., “Alexa, tell daily jokes to tell me ajoke.”) selected by the user.

Once backend system 100 receives Skill ID/Sample Audio ID 4, backendsystem 100 searches for the corresponding skill and invocation. Oncebackend system 100 finds the requested skill and invocation, andrecognizes the request is for a sample, backend system 100 prepares twofiles that are to be sent to electronic device 10. Because the backendsystem recognizes the Skill ID/Sample Audio ID 4, the backend system mayalso recognize that electronic device 10 is requesting that bothstatement 18 and a response to statement 18 be played by electronicdevice 10. The backend system may receive a first text file from acategory or skills module. In this embodiment, the first file is a textfile representing statement 18 selected by touch input 2 on displayscreen 14 of user device 10. The text file representing statement 18 maybe received from a Daily Jokes category server.

In some embodiments, the backend system receives text data representingstatement 18 from a skills or category server. The skills or categoryserver is described in more detail below in the description of FIG. 2.This text file is converted into an audio file by executing TTS on thetext file. The resulting audio file is audio file one 6 a. Audio fileone 6 a is an audio representation of statement 18. A TSS module withinbackend system 100 is described in more detail in the description ofFIG. 2.

In some embodiments, the backend system will also receive display data.The display data may include text representing statement 18. The backendsystem may receive display data one 6 b. Display data one 6 b, in someembodiments, may include text that represents sample audio 18 in textform. The backend system may also receive display data two 8 b. Displaydata two 8 b may include text that represents a response to sample audio18. In some embodiments, the display data one 6 b may be embedded withinaudio file one 6 a. In some embodiments, display data two 8 b may beembedded within audio file two 8 a. Furthermore, in some embodiments,both sets of display data may be sent together.

In some embodiments, a skill module within backend system 100 searchesfor a response to statement 18 of the identified skill Daily Jokes 16.Once the response is identified, a text file representing the responseis sent to the backend system. The text file is then converted into anaudio file by executing TTS on the text file. The resulting audio fileis Audio file two 8 a. Audio file two 8 a is an audio representation ofthe response to statement 18. A skill module within backend system 100is described in more detail in the description of FIG. 2. Audio file one6 a is then transmitted to electronic device 10. Once received byelectronic device 10, Audio file one 6 a (an audio representation ofstatement 18) is played on one or more speakers of electronic device 10.In this embodiment, the user would hear the audio invocation 12A,“Alexa, tell daily jokes to tell me a joke.”

While hearing, or after hearing, audio invocation 12A, an individual maywant to stop the sample. In FIG. 3B, touch input 2 on electronic device10 stops. In some embodiments, once touch input 2 stops, electronicdevice 10 may transmit Data Indicating No Touch Input 302 to backendsystem 100. Backend system 100 may determine that the lack of touchinput 2 should result in the audio stopping. Backend system 100 maygenerate stop instructions 304. In some embodiments, stop instructions304 may cause electronic device 10 to stop playing audio data itreceived from backend system 100. Furthermore, stopping instructions maycause electronic device 10 to stop all audio output on electronic device10.

Stop instructions 304 may then be transmitted to electronic device 10.In some embodiments, this will cause electronic device 10 to stopplaying audio invocation 12A. In some embodiments, if audio response 12Bhas been transmitted to electronic device 10, stopping instructions 304may cause electronic device 10 to stop playing audio response 12B.

FIG. 4A is an illustrative flowchart of a process for using a touchinput to sample a functionality of a backend system in accordance withvarious embodiments. Application, as used in process 400, refer to afunctionality of the backend system. Persons of ordinary skill willrecognize that steps within process 400 may be omitted or arranged in adifferent order. Process 400, in some embodiments, may begin at step402. At step 402, the backend system receives a first identifier. Thebackend system of process 400 may be similar to backend system 100 andthe same description applies. The first identifier sent from anelectronic device to the backend system in step 402 may indicate a skillor application that an individual has selected. As used herein, skill orapplication may be similar to Daily Jokes 16 and the same descriptionapplies. For example, an individual may want to sample a newsapplication entitled “News.” News, as used herein, is merely exemplaryand persons having ordinary skill would recognize that any number ofskills may be used. The first identifier, in some embodiments, mayinclude a skill identification number. The first identifier might be astring of characters including numbers, letters or a combinationthereof. In some embodiments, the skill identification number. Forexample, the first identifier for the application News may be “001A.”The first identifier may be similar to Skill ID/Sample Audio ID 4 andthe same description applies. The electronic device may be similar toelectronic device 10 and the same description applies.

In some embodiments, the backend system determines the first identifierwas sent in response to a touch input. The backend system can receivemany types of data, including, but not limited to, audio data, textdata, identifiers, time and/or date data, a location of electronicdevice 10 (e.g., a GPS location), an IP address associated withelectronic device 10, a type of device that electronic device 10 is, orany other information, or any combination of information. In someembodiments, if the backend system receives an identifier indicating aselected skill, the backend system may recognize that the identifier wassent in response to a touch input. The touch input of step 404 may besimilar to touch input 2 and the same description applies.

In some embodiments, the first identifier may be in response to a samplerequest from a computer, laptop or desktop. In these embodiments, asample request may come in response to a double click or any otherpredefined manner of requesting for a sample. Persons of ordinary skillin the art recognize that any method of requesting a preview of afunction may be suitable.

At step 404, the backend system receives a second identifier. The secondidentifier may indicate a sample audio request. Continuing the Newsapplication example, the sample selected by the user may have a specificstatement. For example, the sample may be “Alexa, tell News to tell mewhat is going on today.” The statement may be similar to statement 18and the same description applies. The second identifier may, in someembodiments, contain a sample audio identification number. Thisidentifier may be a string of characters including numbers, letters or acombination thereof. For example, “Alexa tell News to tell me what isgoing on today,” may have an identifier “001A-1.” The second identifiermay be similar to Skill ID/Sample Audio ID 4 and the same descriptionapplies. Step 404 may be omitted in some embodiments.

At step 406, the backend system determines the first identifier isassociated with an application. After receiving the first identifier,the backend system may then try to match the identifier to a specificapplication. The backend system may use a skills server to determinewhich skill was selected. The skills server used herein may be similarto Category Servers/Skills 262 and the same description applies. Theskills server may have a list of skills, each skill within the listhaving a predetermined identifier. The backend system would then compareeach skill identifier to a list of predetermined skills. In someembodiments, the backend system may receive identifiers that might matchthe first identifier from a variety of skills within a skills server.Once a match is found, the backend system determines a skill associatedwith the first identifier. For example, the backend system may receivean identifier from a Joke application server. The Joke application mayhave an identifier of “002A.” The backend system may receive anidentifier from a Traffic application server. The Traffic applicationmay have an identifier of “001B.” The backend system may receive anidentifier from the News application server. The News application mighthave an identifier of “001A.” If the News application was selected andthe identifier received was “001A,” the backend system may determinethat the application selected was the News application.

If there are no matches found, the backend system may receive text datarepresenting an apology message stating “Sorry, no message was found.”This text may be transmitted to the electronic device where the text maypop up on a display screen of the electronic device. The display screenmay be similar to display screen 14, and the same description applies.In some embodiments, the apology message may be converted into audiodata representing the apology message by executing TTS on the text data.This audio data may be sent to the electronic device such that theapology message is played by one or more speakers of the electronicdevice. The one or more speakers, as described herein, may be similar tospeaker(s) 210 and the same description applies.

At step 408, the backend system determines the second identifier isassociated with a first sample request within the application. Once askill or application has been matched to the first identifier, thebackend system may then search for a statement within the matched skillor application. In some embodiments, the backend system may receiveidentifiers that might match the second identifier from a variety ofsample audios within the matched skill server. Once a match is found,the backend system determines a sample request within the applicationassociated with the second identifier. For example, in some embodiments,there may be three samples stored within the News application. Thebackend system may receive a first sample identifier. The first samplemay be “Alexa, tell News to tell me what is going on today.” The firstsample may have an identifier of “001A-1.” The second sample may be“Alexa, ask News what is the news today.” The second sample may have anidentifier of “001A-2.” The third sample may be “Alexa, ask News whatwent on yesterday.” The third sample may have an identifier of “001A-3.”If the second identifier is “001A-1,” the backend system might determinethe match is the first sample. Persons of ordinary skill recognize thatwhile only three sample requests are described, any number of samplerequests may be stored. Step 408 may be omitted in some embodiments.

At step 410, the backend system determines a first response that isresponsive to the first sample request. After determining the firstsample request, the backend system may then find a response to thesample request. This response may be an actual response. An actualresponse, as used herein, may refer to how the skill or applicationbeing sample would actually respond to a request. For example, if anindividual was sampling a “What is today” application on Jun. 6, 2016,the actual response to a sample request might be “Today is Monday, Jun.6, 2016.” The backend system may determine this response by receivingthe response from the category server. In some embodiments, the backendsystem may determine the correct response by receiving a plurality ofresponses and determining which of the plurality of responses iscorrect. In some embodiments, the NLU might receive confidence scoresfrom the skill server representing responses to the sample audio. Aconfidence score is a representation of how sure a skill is that theirresponse is correct. The NLU may sift through the confidence scores andchoose the highest one. In some embodiments, the skill server may onlysend one response with a high confidence score, indicating that theskill server knows the response. The NLU, as described herein, may besimilar to NLU 260 and the same description applies. For example, theNews application server may send a response “Alexa, ask News what is thenews today,” the News application server may send a confidence scorerepresenting the response stating “The mayor spoke today.” In someembodiments, the response to the sample request may be stored with thesample request. Step 410 may be omitted in some embodiments. In someembodiments, text data representing the response to the first samplerequest may be stored in the backend system in text form. In someembodiments, audio data representing the response may be stored locally.

At step 412, the backend system determines first audio data representingthe first sample request will be sent to an electronic device. Generallyspeaking, the backend system may receive audio data, determine aresponse, and send the responsive audio data. In process 400, thebackend system receives an identifier, indicating that a sample has beenselected. Because a sample has been selected, the backend system maydetermine that the audio generally causing the backend system to find aresponse, will be played by the electronic device. This may allow anindividual to learn and experience how a specific skill or applicationworks. For example, the backend system may determine that the sampleaudio “Alexa, tell News to tell me what is going on today,” will beoutput by the electronic device. Step 412 may be omitted in someembodiments.

In some embodiments, the backend system may determine that the statementwill be sent to a second electronic device. The second electronicdevice, may be, but is not limited to, a voice activated electronicdevice. In order to determine that audio files should be sent to thesecond electronic device, the backend system may need to receive acustomer identification number associated with the electronic device.The backend system may then find a user account associated with thecustomer identification number. Once the user account is located, thebackend system may search for electronic devices associated with theelectronic device. The backend system may find the second electronicdevice is associated with the electronic device. In some embodiments,the backend system may determine that the second electronic device mayreceive the statement.

At step 414, the backend system determines second audio datarepresenting the first response will be sent to the electronic device.After determining the first audio data representing the first samplerequest will be sent to an electronic device, the backend system maydetermine that the response will also be sent to the electronic device.This would allow an individual sampling an application to hear thesample request and a response to that sample request. For example, thebackend system may determine that “The mayor spoke today,” will be sentand output by the electronic device after the electronic device outputs,“Alexa, tell News to tell me what is going on today.” In someembodiments, as with step 412, the backend system may determine that theresponse might be sent to a second electronic device. Furthermore, insome embodiments, the backend system may determine that the samplerequest may be sent to the electronic device and the response may besent to the second electronic device. Step 414 may be omitted in someembodiments.

At step 416, the backend system receives first text data representingthe first sample request. In some embodiments, the text data received bythe backend system will come from an application within category serveror skills server. The category server or skills server may be the sameas, or within Category Servers/Skills 262 of FIG. 2 and the samedescription applies. In some embodiments, the first text data mayrepresent the sample request. For example, the backend system mayreceive first text data from the News application server. The first textdata may represent the sample audio “Alexa, tell News to tell me what isgoing on today.”

In some embodiments, the backend system may also receive instructionsfrom the skill server. The instructions may indicate that the sampleaudio is to be played in a specific voice type. In some embodiments,voice types may differ in frequency spectrums. For example, one voicetype may be in a frequency spectrum ranging from 1000 Hz to 2000 Hz. Asecond voice type may be in a frequency spectrum ranging from 2100 Hz to2800 Hz. While only two different frequency spectrum ranges are given,persons having ordinary skill in the art will recognize that anyfrequency range may be used and only two were used for exemplarypurposes. Additionally, in some embodiments, voice types may havedifferent tones. In some embodiments, voice types may have differenttones and frequency spectrums. In a non-limiting, exemplary embodiment,to enhance the experience of a sample, it may be preferable to play thesample audio request in one voice and the response in another voice. Inthis embodiment, the skill server may send along instructions thatindicate the sample audio is to be output using a certain voice.

At step 418, the backend system generates first audio data representingthe first text data. Once the first text data has been received from acategory server or a skills server, the first text data may be convertedinto audio data. The data is converted into audio data by executing TTSfunctionality on the first text data. The TTS functionality may besimilar to Text-To-Speech 264 of FIG. 2, and the same descriptionapplies. Continuing the News application example, if the text datareceived by the News application server represents the sample audio, theaudio data may represent the following statement, “Alexa, tell News totell me what is going on today.” In some embodiments, if the backendsystem has received instructions indicating that the sample audio is tobe output using a specific voice type, the text data may be convertedinto audio data representing a statement in the specific voice type. Forexample, the audio data representing “Alexa, tell News to tell me whatis going on today,” might be cued to play in a New York accent.

At step 420, the backend system receives second text data representingthe first response. In some embodiments, the second text data receivedby the backend system will come from an application within categoryserver or skills server. The category server or skills server may be thesame as, or within Category Servers/Skills 262 of FIG. 2 and the samedescription applies. In some embodiments, the second text data mayrepresent the response to the sample request. For example, the backendsystem may receive second text data from the News application server.The second text data may represent the response to the sample audio “Themayor is talking today.”

In some embodiments, the backend system may also receive instructionsfrom the skill server. The instructions may indicate that the responseto the sample audio is to be played in a specific voice type. In anon-limiting, exemplary embodiment, to enhance the experience of asample, it may be preferable to play the sample audio request in onevoice and the response in another voice. In this embodiment, the skillserver may send along instructions that indicate the sample audio is tobe output using a certain voice.

At step 422, the backend system generates the second audio datarepresenting the second text data. Once the second text data has beenreceived from a category server or a skills server, the second text datamay be converted into audio data. The data is converted into audio databy executing TTS functionality on the second text data. The TTSfunctionality may be similar to Text-To-Speech 264 of FIG. 2, and thesame description applies. Continuing the News application example, ifthe text data received by the News application server contains theresponse sample audio, the audio data may represent the followingstatement, “The mayor is talking today.” In some embodiments, if thebackend system has received instructions indicating that the response tothe sample audio is to be output using a specific voice type, the textdata may be converted into audio data representing a statement in thespecific voice type. For example, the audio data representing “The mayoris speaking today,” might be cued to play in a New York accent.

In some embodiments, the first audio data and the second audio data maybe played using different voice types. Additionally the first audio dataand the second audio data may be played using the same voice types. Insome embodiments, the backend system may receive instructions to playthe first audio data in a different voice than the second audio data. Inthis embodiment, it may only be necessary to send instructions regardingthe first audio data.

At step 424, the backend system receives first display data. In someembodiments, the first display data received by the backend system willcome from an application within category server or skills server. Thecategory server or skills server may be the same as, or within CategoryServers/Skills 262 of FIG. 2 and the same description applies. Firstdisplay data, in some embodiments, may include text representing thefirst sample request. For example, the first display data may includethe text “Alexa, tell News to tell me what is going on today.” In someembodiments, the display data is stored locally. In some embodiments,step 424 may be omitted.

At step 426, the backend system receives second display data. In someembodiments, the second display data received by the backend system willcome from an application within category server or skills server. Thecategory server or skills server may be the same as, or within CategoryServers/Skills 262 of FIG. 2 and the same description applies. Seconddisplay data, in some embodiments, may include text representing thefirst response. For example, the second display data may include thetext “The mayor is talking today.” In some embodiments, the display datais stored locally. In some embodiments, step 426 may be omitted.

At step 428, the backend system sends the first audio data to theelectronic device. The first audio data, created using TTS functionalityon the first text data, is transmitted to the electronic device. Step424 may be similar to the transmission of audio file one 6 a and thesame description applies. After the first audio data is sent to theelectronic device, the first audio data is output by one or morespeakers of the electronic device. For example, continuing the Newsapplication example, the backend system may send the first audio data tothe electronic device such that the electronic device plays “Alexa, tellNews to tell me what is going on today.” In some embodiments, theelectronic device may play the first audio data in a specific voicetype. For example, the electronic device may play “Alexa, tell News totell me what is going on today,” in a New York accent. Additionally, insome embodiments, the first audio data may be sent to a secondelectronic device such that “Alexa, tell News to tell me what is goingon today,” is played by the second electronic device.

At step 430, the backend system sends the first display data to theelectronic device. Step 430 may be similar to the transmission ofdisplay data one 6 b and the same description applies. After the firstdisplay data is sent to the electronic device, the first display datamay be displayed by a display screen of the electronic device. Thedisplay screen in process 400 may be similar to display screen 14 andthe same description applies. First display data may be displayed in asimilar manner to first text invocation 12C, and the same descriptionapplies. In some embodiments, the display data is stored locally. Inthese embodiments, the first audio data may trigger a response in theelectronic device that causes the electronic device to display the wordsbeing output by the electronic device. In some embodiments, the wordsbeing displayed may be highlighted as the audio data is being output bythe electronic device. For example, when the word “Alexa” is output bythe electronic device, the word “Alexa” being displayed on theelectronic device may be highlighted. In some embodiments, step 430 maybe omitted.

At step 432, the backend system sends the second audio data to theelectronic device. The second audio data, created using TTSfunctionality on the second text data, is transmitted to the electronicdevice. Step 486 may be similar to the transmission of Audio file two 8a and the same description applies. After the second audio data is sentto the electronic device, the second audio data is output by one or morespeakers of the electronic device. For example, continuing the Newsapplication example, the backend system may send the second audio datato the electronic device such that the electronic device plays “Themayor is talking today.” In some embodiments, the electronic device mayplay the second audio data in a specific voice type. For example, theelectronic device may play “The mayor is talking today,” in a New Yorkaccent. Additionally, in some embodiments, the second audio data may besent to a second electronic device such that “The mayor is talkingtoday,” is played by the second electronic device.

In some embodiments, once the first audio data and the second audio datahas been sent, the backend system may receive text data representing amessage. This message may be received from a skills/category server. Insome embodiments, this message may represent confirmation that thesample has been played. For example, the backend system may receive textrepresenting the following message, “Your sample has been played.” Thetext data would then be converted to audio data by executing TTS on thetext data. The audio data representing the confirmation message wouldthen be sent to an electronic device such that the electronic deviceoutputs the confirmation message on one more speakers of the electronicdevice. In some embodiments, this message may represent instructions onenablement of the application that has been sampled. For example, thebackend system may receive text representing the following message, “Youcan enable the News skill by selecting enable on your device.” The textdata would then be converted to audio data by executing TTS on the textdata. The audio data representing the instructions would then be sent toan electronic device such that the electronic device outputs theinstructions on one more speakers of the electronic device. In someembodiments, this message may ask an individual if he or she would liketo enable the sample. For example, the backend system may receive textrepresenting the following message, “Would you like to enable the Newsskill?” The text data would then be converted to audio data by executingTTS on the text data. The audio data representing the question wouldthen be sent to an electronic device such that the electronic deviceoutputs the question on one more speakers of the electronic device. Asused herein, TTS may be similar to TTS 264 of FIG. 2 and the samedescription applies.

At step 434 the backend system sends second display data to theelectronic device. Step 434 may be similar to the transmission ofdisplay data two 8 b and the same description applies. After the seconddisplay data is sent to the electronic device, the second display datamay be displayed by a display screen of the electronic device. Thedisplay screen in process 400 may be similar to display screen 14 andthe same description applies. Second display data may be displayed in asimilar manner to first text response 12D, and the same descriptionapplies. In some embodiments, the display data is stored locally. Inthese embodiments, the second audio data may trigger a response in theelectronic device that causes the electronic device to display the wordsbeing output by the electronic device. In some embodiments, the wordsbeing displayed may be highlighted as the audio data is being output bythe electronic device. For example, when the word “mayor” is output bythe electronic device, the word “mayor” being displayed on theelectronic device may be highlighted. In some embodiments, step 434 maybe omitted.

FIG. 4B is an illustrative flowchart continuing the process in FIG. 4Ato receive another sample of a functionality of a backend system inaccordance with various embodiments. Process 400 may continue, in someembodiments, with step 436. At step 436 the backend system receives datafrom the electronic device. This data, sent from the electronic device,may indicate to the backend system that the electronic device is stilldetecting a touch input. This data may be sent just after the secondaudio data is output on the electronic device. The data may be similarto Touch Input Data 26 and the same description applies. For example,after the electronic device outputs “The mayor is talking today,” theelectronic device may notice that a touch input is still being detected.If the electronic device continues to detect the touch input, theelectronic device may send data indicating the touch input is stilloccurring to the backend system.

In some embodiments, the backend system determines that the electronicdevice is still detecting the touch input. Once the backend systemreceives the data from the electronic device, the backend system mayrecognize that the touch input is still occurring. This may be similarto the disclosure regarding Touch Input Data 26 and the same descriptionapplies. For example, after the electronic device outputs “The mayor istalking today,” the electronic device sends data indicating the touchinput is still occurring to the backend system. Once the data isreceived by the backend system, the backend system may determine thatthe touch input is still occurring.

At step 438, the backend system sends a request to a category server.After determining the touch input is still occurring, the backend systemmay recognize that the individual requesting the sample, would like tohear another sample. In order to meet that request for another sample,the backend system may need to send a request to a category/skillsserver in order to determine whether another sample is available. Asused herein, the category/skills server may be similar to, or withinCategory Servers/Skills 262 and the same description applies. Forexample, continuing the News application example, the backend system maysend a request to the News application server, asking whether anotherresponse is available. In some embodiments, the sample audio andresponses are stored locally. In some embodiments, step 438 may beomitted.

At step 440, the backend system receives a response to the request fromthe category sever. Once the request has been sent to the categoryserver, the category server may respond. The response received by thebackend system may indicate whether another sample is available. If thecategory server response indicates that there is another sample, theprocess continues. If the category server response indicates that thereare no more samples, the process would end here. If the process ends,the backend system may receive text data representing a message from thecategory server. The message may state “Sorry, there are no moresamples.” The text data would then be converted into audio data byexecuting TTS on the text data. The audio data would then be sent to theelectronic device such that the electronic device outputs the message onone or more speakers of the electronic device. In some embodiments, thesample audio and responses are stored locally. In some embodiments, step440 may be omitted.

At step 442, the backend system determines a second sample request isavailable. Once the backend system receives the response, it maydetermine what the response indicates. If the response is a positiveresponse, the backend system may determine a second sample request isavailable. In some embodiments, the response may be a negative response.In this embodiment, the backend system may determine that a secondresponse is not available. In some embodiments, the backend system maydetermine that a second sample request is available by searching textdata representing stored sample requests. In some embodiments, thebackend system may determine that a second response is also available.In some embodiments, step 442 may be omitted.

At step 444, the backend system determines a second response that isresponsive to the second sample request. Step 444 may be similar to step410 and the same description applies. After determining a second samplerequest is available, the backend system may then find a response to thesecond sample request. The backend system may determine this response byreceiving the response from the category server. For example, the Newsapplication server may send a response “Alexa, ask News what was thenews yesterday,” the News application server may send a confidence scorerepresenting the response stating “The mayor spoke yesterday.” In someembodiments, the response to the sample request may be stored with thesample request.

At step 446, the backend system determines that third audio datarepresenting the second sample request will be sent to the electronicdevice. Step 446 may be similar to step 412 and the same descriptionapplies. Because a second sample has been requested, the backend systemmay determine that the second sample will be played by the electronicdevice. This may allow an individual to learn and experience how aspecific skill or application works. For example, the backend system maydetermine that the sample audio “Alexa, tell News to tell me what wasthe news yesterday,” will be output by the electronic device. In someembodiments, step 446 may be omitted.

In some embodiments, the backend system may determine that the secondstatement will be sent to a second electronic device. The secondelectronic device, may be, but is not limited to, a voice activatedelectronic device. In order to determine that audio files should be sentto the second electronic device, the backend system may need to receivea customer identification number associated with the electronic device.The backend system may then find a user account associated with thecustomer identification number. Once the user account is located, thebackend system may search for electronic devices associated with theelectronic device. The backend system may find the second electronicdevice is associated with the electronic device. In some embodiments,the backend system may determine that the second electronic device mayreceive the second statement.

At step 448, the backend system determines that fourth audio datarepresenting the second response will be sent to the electronic device.Step 448 may be similar to step 414 and the same description applies.After determining the third audio data representing the second samplerequest will be sent to an electronic device, the backend system maydetermine that the response will also be sent to the electronic device.This would allow an individual sampling an application to hear thesample request and a response to that sample request. For example, thebackend system may determine that “The mayor spoke yesterday,” will besent and output by the electronic device after the electronic deviceoutputs, “Alexa, tell News to tell me what was the news yesterday.” Insome embodiments, the backend system may determine that the responsemight be sent to a second electronic device. In some embodiments, step448 may be omitted.

At step 450, the backend system receives third text data representingthe second sample request. Step 450 may be similar to step 416 and thesame description applies. In some embodiments, the text data received bythe backend system will come from a category server or skills server.The category server or skills server may be the same as, or withinCategory Servers/Skills 262 of FIG. 2 and the same description applies.In some embodiments, the third text data may represent the second samplerequest. For example, the backend system may receive third text datafrom the News application server. The third text data may represent thesecond sample audio “Alexa, tell News to tell me what was the newsyesterday.”

In some embodiments, the backend system may also receive instructionsfrom the skill server. The instructions may indicate that the sampleaudio is to be played in a specific voice type. In some embodiments,voice types may differ in frequency spectrums. For example, one voicetype may be in a frequency spectrum ranging from 1000 Hz to 2000 Hz. Asecond voice type may be in a frequency spectrum ranging from 2100 Hz to2800 Hz. While only two different frequency spectrum ranges are given,persons having ordinary skill in the art will recognize that anyfrequency range may be used and only two were used for exemplarypurposes. Additionally, in some embodiments, voice types may havedifferent tones. In some embodiments, voice types may have differenttones and frequency spectrums. In a non-limiting, exemplary embodiment,to enhance the experience of a sample, it may be preferable to play thesample audio request in one voice and the response in another voice. Inthis embodiment, the skill server may send along instructions thatindicate the sample audio is to be output using a certain voice.

At step 452, the backend system generates the third audio datarepresenting the third text data. Step 452 may be similar to step 418and the same description applies. Once the third text data has beenreceived from a category server or a skills server, the third text datamay be converted into audio data. The data is converted into audio databy executing TTS functionality on the third text data. The TTSfunctionality may be similar to Text-To-Speech 264 of FIG. 2, and thesame description applies. Continuing the News application example, ifthe text data received by the News application server represents thesecond sample audio, the audio data may represent the followingstatement, “Alexa, tell News to tell me what the news was yesterday.” Insome embodiments, if the backend system has received instructionsindicating that the sample audio is to be output using a specific voicetype, the text data may be converted into audio data representing astatement in the specific voice type. For example, the audio datarepresenting “Alexa, tell News to tell me what the news was yesterday,”might be cued to play in a New York accent.

At step 454, the backend system receives fourth text data representingthe second response. Step 454 may be similar to step 420 and the samedescription applies. In some embodiments, the fourth text data receivedby the backend system will come from a category server or skills server.The category server or skills server may be the same as, or withinCategory Servers/Skills 262 of FIG. 2 and the same description applies.In some embodiments, the fourth text data may represent the response tothe sample request. For example, the backend system may receive fourthtext data from the News application server. The fourth text data mayrepresent the response to the sample audio “The mayor talked yesterday.”

At step 456, the backend system generates the fourth audio datarepresenting the fourth text data. Step 456 may be similar to step 422and the same description applies. Once the fourth text data has beenreceived from a category server or a skills server, the fourth text datamay be converted into audio data. The text data is converted into audiodata by executing TTS functionality on the fourth text data. The TTSfunctionality may be similar to Text-To-Speech 264 of FIG. 2, and thesame description applies. Continuing the News application example, ifthe text data received by the News application server contains theresponse to the second sample audio, the audio data may represent thefollowing statement, “The mayor talked yesterday.” In some embodiments,if the backend system has received instructions indicating that theresponse to the sample audio is to be output using a specific voicetype, the text data may be converted into audio data representing astatement in the specific voice type. For example, the audio datarepresenting “The mayor talked yesterday,” might be cued to play in aNew York accent.

At step 458, the backend system receives third display data. In someembodiments, the third display data received by the backend system willcome from an application within category server or skills server. Thecategory server or skills server may be the same as, or within CategoryServers/Skills 262 of FIG. 2 and the same description applies. Thirddisplay data, in some embodiments, may include text representing thesecond sample request. For example, the first display data may includethe text “Alexa, tell News to tell me what the news was yesterday.” Insome embodiments, the display data may be stored locally on theelectronic device. In some embodiments, step 458 may be omitted.

At step 460, the backend system receives fourth display data. In someembodiments, the fourth display data received by the backend system willcome from an application within category server or skills server. Thecategory server or skills server may be the same as, or within CategoryServers/Skills 262 of FIG. 2 and the same description applies. Fourthdisplay data, in some embodiments, may include text representing thesecond response. For example, the second display data may include thetext “The mayor is talked yesterday.” In some embodiments, the displaydata may be stored locally on the electronic device. In someembodiments, step 460 may be omitted.

At step 462, the backend system sends the third audio data to theelectronic device. Step 462 may be similar to step 428 and the samedescription applies. The third audio data, created using TTSfunctionality on the third text data, is transmitted to the electronicdevice. Step 454 may be similar to the transmission of audio file three28 a and the same description applies. After the third audio data issent to the electronic device, the third audio data is output by one ormore speakers of the electronic device. For example the backend systemmay send the first audio data to the electronic device such that theelectronic device plays “Alexa, tell News to tell me what the news wasyesterday.” In some embodiments, the electronic device may play thethird audio data in a specific voice type. For example, the electronicdevice may play “Alexa, tell News to tell me what the news wasyesterday,” in a New York accent. Additionally, in some embodiments, thethird audio data may be sent to a second electronic device such that“Alexa, tell News to tell me what the news was yesterday,” is played bythe second electronic device.

At step 464, the backend system sends the third display data to theelectronic device. Step 464 may be similar to the transmission ofdisplay data three 28 b and the same description applies. After thethird display data is sent to the electronic device, the third displaydata may be displayed by a display screen of the electronic device. Thedisplay screen in process 400 may be similar to display screen 14 andthe same description applies. Third display data may be displayed in asimilar manner to second text invocation 32C, and the same descriptionapplies. In some embodiments, the display data is stored locally. Inthese embodiments, the first audio data may trigger a response in theelectronic device that causes the electronic device to display the wordsbeing output by the electronic device. In some embodiments, the wordsbeing displayed may be highlighted as the audio data is being output bythe electronic device. For example, when the word “Alexa” is output bythe electronic device, the word “Alexa” being displayed on theelectronic device may be highlighted. In some embodiments, step 464 maybe omitted.

At step 466, the backend system sends the fourth audio data to theelectronic device. Step 466 may be similar to step 432 and the samedescription applies. The third audio data, created using TTSfunctionality on the third text data, is transmitted to the electronicdevice. Step 466 may be similar to the transmission of audio file four30 a and the same description applies. After the fourth audio data issent to the electronic device, the fourth audio data is output by one ormore speakers of the electronic device. For example, the backend systemmay send the fourth audio data to the electronic device such that theelectronic device plays “The mayor talked yesterday.” In someembodiments, the electronic device may play the fourth audio data in aspecific voice type. For example, the electronic device may play “Themayor talked yesterday,” in a New York accent. Additionally, in someembodiments, the fourth audio data may be sent to a second electronicdevice such that “The mayor talked yesterday,” is played by the secondelectronic device.

At step 468 the backend system sends fourth display data to theelectronic device. Step 466 may be similar to the transmission ofdisplay data four 30 b and the same description applies. After thefourth display data is sent to the electronic device, the fourth displaydata may be displayed by a display screen of the electronic device. Thedisplay screen in process 400 may be similar to display screen 14 andthe same description applies. Fourth display data may be displayed in asimilar manner to second text response 32D, and the same descriptionapplies. In some embodiments, the display data is stored locally. Inthese embodiments, the first audio data may trigger a response in theelectronic device that causes the electronic device to display the wordsbeing output by the electronic device. In some embodiments, the wordsbeing displayed may be highlighted as the audio data is being output bythe electronic device. For example, when the word “mayor” is output bythe electronic device, the word “mayor” being displayed on theelectronic device may be highlighted. In some embodiments, step 468 maybe omitted.

In some embodiments, the electronic device might continue to detect thetouch input. If this is the case, additional data may be sent to thebackend system and the process in FIG. 4B would start again. In someembodiments, there might be no more samples. If this is the case, thebackend system may receive text data representing a message. Thismessage may be received from a skills/category server. In someembodiments, this message may represent notice that there are no moresamples to be played. For example, the backend system may receive textrepresenting the following message, “There are no more samples.” Thetext data would then be converted to audio data by executing TTS on thetext data. The audio data representing the message would then be sent toan electronic device such that the electronic device outputs the messageon one more speakers of the electronic device.

FIG. 5 is an illustrative flowchart of a process for stopping a sampleof a functionality of a backend system in accordance with variousembodiments. Application, as used in process 500, refers to afunctionality of the backend system. The backend system of process 500may be similar to backend system 100 and the same description applies.Persons of ordinary skill will recognize that some steps in process 500may be omitted or rearranged. Process 500 may, in some embodiments,begin at step 502. At step 502, the backend system receives a firstidentifier. The first identifier sent from an electronic device to thebackend system in step 502 may indicate a skill or application that anindividual has selected. Step 502 may be similar to step 402 and thesame description applies. For example, an individual may want to samplea news application entitled “News.” The first identifier, in someembodiments, may include a skill identification number. The firstidentifier might be a string of characters including numbers, letters ora combination thereof. In some embodiments, the skill identificationnumber. For example, the first identifier for the application News maybe “002A.” The first identifier may be similar to Skill ID/Sample AudioID 4 and the same description applies. The electronic device may besimilar to electronic device 10 and the same description applies.

In some embodiments the backend system determines the first identifierwas sent in response to a touch input. In some embodiments, when thebackend system receives an identifier indicating a selected skill, thebackend system may recognize that the identifier was sent in response toa touch input. The touch input may be similar to touch input 2 and thesame description applies.

At step 504, the backend system receives a second identifier. Step 504may be similar to step 404 and the same description applies. The secondidentifier may indicate a sample audio request within the selectedskill. For example, the sample selected by the user may have a specificstatement. For example, the sample may be “Alexa, tell News to tell mewhat is going on today.” The statement may be similar to statement 18and the same description applies. The second identifier may, in someembodiments, contain a sample audio identification number. Thisidentifier may be a string of characters including numbers, letters or acombination thereof. For example, “Alexa tell News to tell me what isgoing on today,” may have an identifier “002A-1.” The second identifiermay be similar to Skill ID/Sample Audio ID 4 and the same descriptionapplies. In some embodiments, step 504 may be omitted.

At step 506, the backend system determines the first identifier isassociated with an application. Step 506 may be similar to step 406 andthe same description applies. After receiving the first identifier, thebackend system may then try to match the identifier to a specificapplication. The backend system may use a skills server to determinewhich skill was selected. The skills server used herein may be similarto Category Servers/Skills 262 and the same description applies. Theskills server may have a list of skills, each skill within the listhaving a predetermined identifier. The backend system may then compareeach skill identifier to a list of skills stored in the skills/categoryserver.

At step 508, the backend system determines the second identifier isassociated with a first sample request within the application. Step 508may be similar to step 408 and the same description applies. Once askill or application has been matched to the first identifier, thebackend system may then search for an invocation within the matchedskill or application. In some embodiments, the match is sent to thebackend system. Once a match is found, the backend system determines asample request within the application associated with the secondidentifier. In some embodiments, step 508 may be omitted.

At step 510, the backend system determines a first response that isresponsive to the first sample request. Step 510 may be similar to step410 and the same description applies. Once the backend system determinesthe first sample request, the backend system may then determine aresponse to the first sample request. The backend system may determinethis response by receiving the response from the category server. Insome embodiments, the backend system may determine the correct responseby receiving a plurality of responses and determining which of theplurality of responses is correct. In some embodiments, the response tothe sample request may be stored with the sample request. In someembodiments, step 510 may be omitted.

At step 512, the backend system determines first audio data representingthe first sample request will be sent to an electronic device. Step 512may be similar to step 412 and the same description applies. In process500, the backend system receives an identifier, indicating that a samplehas been selected. Because a sample has been selected, the backendsystem may determine that the audio generally causing the backend systemto find a response, will be played by the electronic device. This mayallow an individual to learn and experience how a specific skill orapplication works. For example, the backend system may determine thatthe sample audio “Alexa, tell News to tell me what is going on today,”will be output by the electronic device. In some embodiments, step 512may be omitted.

At step 514, the backend system determines second audio datarepresenting the first response will be sent to the electronic device.Step 514 may be similar to step 414 and the same description applies.Once the backend system has determined the first audio data will be sentto the electronic device, the backend system may also determine thatsecond audio data representing a response will also be sent to theelectronic device. This would allow an individual sampling anapplication to hear the sample request and a response to that samplerequest. For example, the backend system may determine that “The mayorspoke today,” will be sent and output by the electronic device after theelectronic device outputs, “Alexa, tell News to tell me what is going ontoday.” In some embodiments, as with step 512, the backend system maydetermine that the response might be sent to a second electronic device.Furthermore, in some embodiments, the backend system may determine thatthe sample request may be sent to the electronic device and the responsemay be sent to the second electronic device. In some embodiments, step514 may be omitted.

At step 516, the backend system receives first text data representingthe first sample request. Step 516 may be similar to step 416 and thesame description applies. In some embodiments, the text data received bythe backend system will be sent from a category server or skills server.The category server or skills server may be the same as, or withinCategory Servers/Skills 262 of FIG. 2 and the same description applies.In some embodiments, the first text data may represent the first samplerequest. For example, the backend system may receive first text datafrom the News application server. The first text data may represent thesample audio “Alexa, tell News to tell me what is going on today.”

In some embodiments, the backend system may also receive instructionsfrom the skill server. The instructions may indicate that the sampleaudio is to be played in a specific voice type. In a non-limiting,exemplary embodiment, to enhance the experience of a sample, it may bepreferable to play the sample audio request in one voice and theresponse in another voice. In this embodiment, the skill server may sendalong instructions that indicate the sample audio is to be output usinga certain voice.

At step 518, the backend system generates first audio data representingthe first text data. Step 518 may be similar to step 418 and the samedescription applies. Once the first text data has been received from acategory server or a skills server, the first text data may be convertedinto audio data. The data is converted into audio data by executing TTSfunctionality on the first text data. The TTS functionality may besimilar to Text-To-Speech 264 of FIG. 2, and the same descriptionapplies. For example, if the text data received by a News applicationserver contains the sample audio, the audio data may represent thefollowing statement, “Alexa, tell News to tell me what is going ontoday.” In some embodiments, if the backend system has receivedinstructions indicating that the sample audio is to be output using aspecific voice type, the text data may be converted into audio datarepresenting a statement in the specific voice type. For example, theaudio data representing “Alexa, tell News to tell me what is going ontoday,” might be cued to play in a New York accent.

At step 520, the backend system receives second text data representingthe first response. Step 520 may be similar to step 420 and the samedescription applies. In some embodiments, the second text data receivedby the backend system will be sent from a category server or skillsserver. The category server or skills server may be the same as, orwithin Category Servers/Skills 262 of FIG. 2 and the same descriptionapplies. In some embodiments, the second text data may represent theresponse to the first sample request. For example, the backend systemmay receive second text data from the News application server. Thesecond text data may represent the response to the sample audio “Themayor is talking today.”

In some embodiments, the backend system may also receive instructionsfrom the skill server. The instructions may indicate that the responseto the sample audio is to be played in a specific voice type. In anon-limiting, exemplary embodiment, to enhance the experience of asample, it may be preferable to play the sample audio request in onevoice and the response in another voice. In this embodiment, the skillserver may send along instructions that indicate the sample audio is tobe output using a certain voice.

At step 522, the backend system generates the second audio datarepresenting the second text data. Step 522 may be similar to step 422and the same description applies. Once the second text data has beenreceived from a category server or a skills server, the second text datamay be converted into audio data. The data is converted into audio databy executing TTS functionality on the second text data. The TTSfunctionality may be similar to Text-To-Speech 264 of FIG. 2, and thesame description applies. For example, if the text data received by aNews application server contains the response sample audio, the audiodata may represent the following statement, “The mayor is talkingtoday.” In some embodiments, if the backend system has receivedinstructions indicating that the response to the sample audio is to beoutput using a specific voice type, the text data may be converted intoaudio data representing a statement in the specific voice type. Forexample, the audio data representing “The mayor is speaking today,”might be cued to play in a New York accent.

At step 524, the backend system receives first display data. In someembodiments, the first display data received by the backend system willcome from an application within category server or skills server. Thecategory server or skills server may be the same as, or within CategoryServers/Skills 262 of FIG. 2 and the same description applies. Firstdisplay data, in some embodiments, may include text representing thefirst sample request. For example, the first display data may includethe text “Alexa, tell News to tell me what is going on today.” In someembodiments, the display data may be stored locally on the electronicdevice. In some embodiments, step 524 may be omitted.

At step 526, the backend system receives second display data. In someembodiments, the second display data received by the backend system willcome from an application within category server or skills server. Thecategory server or skills server may be the same as, or within CategoryServers/Skills 262 of FIG. 2 and the same description applies. Seconddisplay data, in some embodiments, may include text representing thefirst response. For example, the second display data may include thetext “The mayor is talking today.” In some embodiments, the display datamay be stored locally on the electronic device. In some embodiments,step 526 may be omitted.

At step 528, the backend system sends the first audio data to theelectronic device. Step 528 may be similar to step 428 and the samedescription applies. The first audio data, created using TTSfunctionality on the first text data, is transmitted to the electronicdevice. Step 526 may be similar to the transmission of audio file one 6a and the same description applies. After the first audio data is sentto the electronic device, the first audio data is output by one or morespeakers of the electronic device. For example, the backend system maysend the first audio data to the electronic device such that theelectronic device plays “Alexa, tell News to tell me what is going ontoday.” In some embodiments, the electronic device may play the firstaudio data in a specific voice type. For example, the electronic devicemay play “Alexa, tell News to tell me what is going on today,” in a NewYork accent. Additionally, in some embodiments, the first audio data maybe sent to a second electronic device such that “Alexa, tell News totell me what is going on today,” is played by the second electronicdevice.

At step 530, the backend system sends the first display data to theelectronic device. Step 530 may be similar to the transmission ofdisplay data one 6 b and the same description applies. After the firstdisplay data is sent to the electronic device, the first display datamay be displayed by a display screen of the electronic device. Thedisplay screen in process 400 may be similar to display screen 14 andthe same description applies. First display data may be displayed in asimilar manner to first text invocation 12C, and the same descriptionapplies. In some embodiments, the display data is stored locally. Inthese embodiments, the first audio data may trigger a response in theelectronic device that causes the electronic device to display the wordsbeing output by the electronic device. In some embodiments, the wordsbeing displayed may be highlighted as the audio data is being output bythe electronic device. For example, when the word “Alexa” is output bythe electronic device, the word “Alexa” being displayed on theelectronic device may be highlighted. In some embodiments, step 530 maybe omitted.

At step 532, the backend system receives data from the electronicdevice. This data, sent from the electronic device, may indicate to thebackend system that the electronic device is no longer detecting aninput causing the sample to occur. In some embodiments, this data mayindicate a touch input is no longer occurring. This data may be sentjust after the first audio data is output on the electronic device. Insome embodiments, this data may be sent during the output of the firstaudio. The data may be similar to Data Indicating No Touch Input 302 andthe same description applies. For example, after the electronic deviceoutputs “Alexa, tell News to tell me what is going on today,” theelectronic device may notice that a touch input is no longer beingdetected. If the electronic device does not detect the touch input, theelectronic device may send data indicating the touch input is notoccurring to the backend system.

At step 534, the backend system determines the electronic device is notdetecting the sample request. In some embodiments, the backend systemdetermines that the backend system is no longer detecting a touch input.Once the backend system receives the data from the electronic device,the backend system may recognize that the sample should stop. This maybe similar to the disclosure regarding Data Indicating No Touch Input302 and the same description applies. For example, after the electronicdevice outputs “Alexa, tell News to tell me what is going on today,” theelectronic device sends data indicating the input no longer occurring tothe backend system. Once the data is received by the backend system, thebackend system may determine that the sample should stop.

At step 536, the backend system generates instructions for theelectronic device. The stop instructions may be for the purposes ofstopping the first audio data from being played by the electronicdevice. Stop instructions may direct the electronic device to stop thesample process entirely.

At step 538, the backend system sends the instructions to the electronicdevice. After generating the stop instructions, the backend system maythen send the stop instructions to the electronic device causing theelectronic device to stop outputting the first audio and to not play thesecond audio. The transmission of the stop instructions may be similarto Stop Instructions 304 and the same description applies. The stopinstructions may also cause the second data to not be sent to theelectronic device.

FIG. 6 is an illustrative diagram of an exemplary user interface showingmultiple skills in accordance with various embodiments. Graphical userinterface 24 may be, in some embodiments, shown on display screen 14 ofelectronic device 10. Displayed with graphical user interface 24 may bea list of skills. In some embodiments, skill one 702, skill two 704,skill three 706, skill four 708, skill five 710, skill six 712, skillseven 714, and skill eight 716 may be displayed on display screen 14.Each of the aforementioned skills may correspond to different skillsthat are capable of completing different tasks. In some embodiments,skill one 702, skill two 704, skill three 706, skill four 708, skillfive 710, skill six 712, skill seven 714, and skill eight 716 may besimilar to Daily Jokes 16 and the same description applies. Each skillcan be individually selected and sampled. The sampling of a skill isdescribed in more detail in FIGS. 4A, 4B, and 5 and the samedescriptions apply. While eight skills are shown in FIG. 7, persons ofordinary skill recognize that any number of skills may be displayed.

The various embodiments of the invention may be implemented by software,but may also be implemented in hardware, or in a combination of hardwareand software. The invention may also be embodied as computer readablecode on a computer readable medium. The computer readable medium may beany data storage device that may thereafter be read by a computersystem.

The above-described embodiments of the invention are presented forpurposes of illustration and are not intended to be limiting. Althoughthe subject matter has been described in language specific to structuralfeature, it is also understood that the subject matter defined in theappended claims is not necessarily limited to the specific featuresdescribed. Rather, the specific features are disclosed as illustrativeforms of implementing the claims.

1-20. (canceled)
 21. A computer-implemented method, comprising: causing a graphical user interface (GUI) to be displayed on a device; receiving a user selection corresponding to an application displayed using the GUI; receiving input audio data corresponding to a command corresponding to the application; processing the input audio data to determine input text data; processing the input text data to determine intent data corresponding to the command; using the intent data to determine output text data; performing text-to-speech processing on the output text data to determine output audio data; and causing output audio data to be sent to the device.
 22. The computer-implemented method of claim 21, further comprising: determining the intent data corresponds to a first invocation of the application; sending the intent data to the application; and receiving at least a portion of the output text data from the application.
 23. The computer-implemented method of claim 21, wherein the GUI is configured to display data corresponding to a plurality of applications.
 24. The computer-implemented method of claim 21, further comprising: receiving, from the device, a first identifier corresponding to the application, wherein a representation corresponding to the first identifier was selected from a displayed set of options on the GUI.
 25. The computer-implemented method of claim 21, wherein the user selection corresponds to a touch input.
 26. The computer-implemented method of claim 21, further comprising: determining, based at least in part on the application, data corresponding to a voice type, wherein the performing the text-to-speech processing is based at least in part on the data corresponding to the voice type.
 27. The computer-implemented method of claim 21, further comprising: causing a microphone associated with the device to determine the input audio data based at least in part on audio corresponding to the command.
 28. The computer-implemented method of claim 27, wherein the causing the microphone to determine the audio is performed at least in part in response to the user selection.
 29. The computer-implemented method of claim 21, further comprising: determining a second user selection; and in response to the second user selection, ceasing certain processing with regard to the input audio data.
 30. The computer-implemented method of claim 21, wherein the processing the input text data to determine the intent data is based at least in part on the application.
 31. A system comprising: at least one processor; and at least one memory comprising instructions that, when executed by the at least one processor, cause the system to: cause a graphical user interface (GUI) to be displayed on a device; receive a user selection corresponding to an application displayed using the GUI; receive input audio data corresponding to a command corresponding to the application; process the input audio data to determine input text data; process the input text data to determine intent data corresponding to the command; use the intent data to determine output text data; perform text-to-speech processing on the output text data to determine output audio data; and cause output audio data to be sent to the device.
 32. The system of claim 31, wherein the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the system to: determine the intent data corresponds to a first invocation of the application; send the intent data to the application; and receive at least a portion of the output text data from the application.
 33. The system of claim 31, wherein the GUI is configured to display data corresponding to a plurality of applications.
 34. The system of claim 31, wherein the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the system to: receive, from the device, a first identifier corresponding to the application, wherein a representation corresponding to the first identifier was selected from a displayed set of options on the GUI.
 35. The system of claim 31, wherein the user selection corresponds to a touch input.
 36. The system of claim 31, wherein the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the system to: determine, based at least in part on the application, data corresponding to a voice type, wherein performing the text-to-speech processing is based at least in part on the data corresponding to the voice type.
 37. The system of claim 31, wherein the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the system to: cause a microphone associated with the device to determine the input audio data based at least in part on audio corresponding to the command.
 38. The system of claim 37, wherein causing the microphone to determine the audio is performed at least in part in response to the user selection.
 39. The system of claim 31, wherein the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the system to: determine a second user selection; and in response to the second user selection, cease certain processing with regard to the input audio data.
 40. The system of claim 31, wherein processing the input text data to determine the intent data is based at least in part on the application. 